Marketing Campaign Analysis¶

Executive Summary¶

Key Takeaways¶

We propose using Hierarchical Clustering and with our model, we segmented the cusomters into 6 clusters. Here is a quick summary of each cluster:

  • Cluster 0 is a large number of high income, educated adults with few or no children. Relatively high spending
  • Cluster 1 is made up of middle class families with one or more children. They look to get the most bang for their buck.
  • Cluster 2 is made up of single parents with high education.
  • Cluster 3 is the largest cluster with very low purchasing power. Least education and lowest income.
  • Cluster 4 is our highest spending cluster made up of high educated and our highest income customers with no children. These are our most receptive customers to marketing campaigns
  • Cluster 5 is the smallest cluster made up of the customers who have complained in the last 2 years.

Through our EDA and data preprocessing, we can calculate a few extra metrics including: total household size, average purchase price, total spent on needs, and total spent on luxuries. We can see that the majority of customers in our system make very few purchases skewing our data to the right. We have a long thin tail for purchases spent on necessities, but a more substantial tail for purchasing luxuries. Not surprisingly, our most influential metric does seem to be income.

What are the next steps for the company¶

Use our model to tailor your marketing strategy for each cluster.

  • Spend resources on growing cluster 4. Use the profile to use targeted ads on social media to bring in more customers that will fit into cluster 4 because they tend to spend the most money.
  • Cluster 0 is our bread and butter cluster. We need to emphasize our loyalty program with them. Make sure they feel welcomed and appreciated. We can offer loyalty bonuses like tote bags and our store credit card.
  • Cluster 1 will be our most price elastic customers so we should give targeted coupons and send out targeted ads via social media or email for sales on items that we want to move, focused mostly on necessities and family related products
  • Cluster 2 is receptive to marketing campaigns so we can send out general campaigns to them, but I wouldn't focus a lot of resources on them specifically.
  • Clusters 3 and 5, we can mostly ignore with our marketing efforts. Cluster 3 is large but doesn't have a lot of potential and cluster 5 is very small and we can handle their issues through customer support.

Problem Definition¶

The Context:¶

  • Why is this problem important to solve?

In sports, a divide has grown between the up-and-coming data analytics crowd and the old school scouts. The old school scouts say that the analytics don't take into account the "Eye Test," aka the evaluation of how an athlete performs visually vs how they appear in the stats. The issue is, that's not even the question that we should be asking, the question that should be asked is "how can analytics improve the eye test?" A trained scout's eye can see many things that statistics will miss, but the power in statistics is that stats "watch" every single game. That's the problem that we're trying to solve right now. We need to understand each and every one of customers to utilize our resources efficiently. We, as analysts, can look at demographics and see what kinds of people buy what kinds of products, but there are many underlying trends that humans can't see. We need algorithms to be able to comb through the data and identify patterns in customer behaviors.

The objective:¶

  • What is the intended goal?

The intended goal is analyze our customer data in order to segment the customers into different clusters. We will then be able to profile each cluster to understand how each kind of customer interacts with our brand.

The key questions:¶

  • What are the key questions that need to be answered?

Who are our customers overall? What are the general characteristics of our customer base? Which variables are most important in predicting customer behavior? What additional information can we extract from our dataset? How many clusters should we segment into? What kind of customer makes up each cluster and how do those customers interact with our brand?

The problem formulation:¶

  • What is it that we are trying to solve using data science?

We are using data science to find the patterns in our customer's behavior and find similar customers who might be likely to behave the same way. Using algorithms to identify similar customers will be far easier than playing a 2240 piece game of Memory trying to match customers up. Data science will also be able to identify the most important variables and determine the weight they carry in our data.

Problem and Solution Summary¶

Problem¶

The overarching problem that we are trying to solve is how can we better serve our customers and attract more in the ultimate goal of growing our business. We have data about our customers and their actions, we need to better understand who our customers are and how they behave so that we can most effectively use our resources to service them.

Proposed Solution Steps¶

  1. Explore the data. Use visualizations to better understand the customer distributions in our different variables and identify outliers.
  2. Clean the data: dropping irrelevant variables, filtering out the extreme outliers. We also will do feature engineering in this step to consolidate our data into more relevant and understandable metrics.
  3. Use PCA to consolidate our variables even further. Use various metrics to understand the Principal Components and determine the features we should use to make our clusters.
  4. Use the PCA features selected to cluster. We will try multiple algorithms to see which one makes the best clusters. The algorithms we will try will include K-Means, K-Medoids, Hierarchical Clustering, DBScan, and Gaussian Mixture Modeling.
  5. Examine the different clusters from each algorithm and determine their profiles based on their demographics.
  6. Give our recommendation for which algorithm to use and what the profiles of that cluster are.
  7. Provide action steps to capitalize on the insights from our model

Our solution will provide you with a full understanding of your customer base. You will have detailed profiles of customer archetypes and their behaviors. This will allow you to target your marketing efforts to different clusters, run A/B testing to see which types of ads work on which types of customers, and minimize wasted resources.


Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Import the necessary libraries and load the data¶

In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from datetime import datetime

from sklearn.cluster import KMeans

from sklearn.mixture import GaussianMixture

from sklearn_extra.cluster import KMedoids

from sklearn.metrics import silhouette_score

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

from scipy.spatial.distance import cdist, pdist

import warnings
warnings.filterwarnings("ignore")
In [2]:
data = pd.read_csv('marketing_campaign.csv')
data_copy = data.copy()

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the column which has no null values
In [3]:
data.shape
Out[3]:
(2240, 27)
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [5]:
# Convert to date time
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format ='%d-%m-%Y')
In [6]:
# Check to see if there are duplicates
data[data.duplicated()]
Out[6]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response

0 rows × 27 columns

In [7]:
# Drop the rows with null values in the income column
data_no_null = data.dropna()
data_no_null.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2216 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID                   2216 non-null   int64         
 1   Year_Birth           2216 non-null   int64         
 2   Education            2216 non-null   object        
 3   Marital_Status       2216 non-null   object        
 4   Income               2216 non-null   float64       
 5   Kidhome              2216 non-null   int64         
 6   Teenhome             2216 non-null   int64         
 7   Dt_Customer          2216 non-null   datetime64[ns]
 8   Recency              2216 non-null   int64         
 9   MntWines             2216 non-null   int64         
 10  MntFruits            2216 non-null   int64         
 11  MntMeatProducts      2216 non-null   int64         
 12  MntFishProducts      2216 non-null   int64         
 13  MntSweetProducts     2216 non-null   int64         
 14  MntGoldProds         2216 non-null   int64         
 15  NumDealsPurchases    2216 non-null   int64         
 16  NumWebPurchases      2216 non-null   int64         
 17  NumCatalogPurchases  2216 non-null   int64         
 18  NumStorePurchases    2216 non-null   int64         
 19  NumWebVisitsMonth    2216 non-null   int64         
 20  AcceptedCmp3         2216 non-null   int64         
 21  AcceptedCmp4         2216 non-null   int64         
 22  AcceptedCmp5         2216 non-null   int64         
 23  AcceptedCmp1         2216 non-null   int64         
 24  AcceptedCmp2         2216 non-null   int64         
 25  Complain             2216 non-null   int64         
 26  Response             2216 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(23), object(2)
memory usage: 484.8+ KB

Observations and Insights from the Data overview:¶

  • We have 2240 total rows and 27 columns, including the ID column. We are missing 24 values in the income row so we need to drop those 24 rows.
  • Dt_Customer was an object but it is a date when the customer joined our system. We need to transform that with to_datetime.
  • We only have 2 categorical variables and everything else is numerical. Accepted campaigns, complaints, and responses are binary yes or no variables so those will need to be treated a little bit differently.
  • There are a lot of variables that seem unnecessary to differentiate between, like kid and teenhome, so we can combine them later.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
  3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
In [82]:
# Drop the ID Column, it will not be helpful in our analysis
data_dropped = data_no_null.drop(columns = 'ID')

# Identify all the numeric columns
num_cols = data_dropped.select_dtypes(include=['number'])

# Isolate the binary response columns
num_cols_no_response = num_cols.iloc[:,:16]
response_cols = num_cols[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4',
       'AcceptedCmp5', 'Complain', 'Response']]

# Distiniguish the categorical columns
cat_cols = data_dropped.select_dtypes(include=['object'])
In [83]:
# Check summary statistics of numeric columns
round(data_dropped.describe().T,2)
Out[83]:
count mean std min 25% 50% 75% max
Year_Birth 2216.0 1968.82 11.99 1893.0 1959.0 1970.0 1977.00 1996.0
Income 2216.0 52247.25 25173.08 1730.0 35303.0 51381.5 68522.00 666666.0
Kidhome 2216.0 0.44 0.54 0.0 0.0 0.0 1.00 2.0
Teenhome 2216.0 0.51 0.54 0.0 0.0 0.0 1.00 2.0
Recency 2216.0 49.01 28.95 0.0 24.0 49.0 74.00 99.0
MntWines 2216.0 305.09 337.33 0.0 24.0 174.5 505.00 1493.0
MntFruits 2216.0 26.36 39.79 0.0 2.0 8.0 33.00 199.0
MntMeatProducts 2216.0 167.00 224.28 0.0 16.0 68.0 232.25 1725.0
MntFishProducts 2216.0 37.64 54.75 0.0 3.0 12.0 50.00 259.0
MntSweetProducts 2216.0 27.03 41.07 0.0 1.0 8.0 33.00 262.0
MntGoldProds 2216.0 43.97 51.82 0.0 9.0 24.5 56.00 321.0
NumDealsPurchases 2216.0 2.32 1.92 0.0 1.0 2.0 3.00 15.0
NumWebPurchases 2216.0 4.09 2.74 0.0 2.0 4.0 6.00 27.0
NumCatalogPurchases 2216.0 2.67 2.93 0.0 0.0 2.0 4.00 28.0
NumStorePurchases 2216.0 5.80 3.25 0.0 3.0 5.0 8.00 13.0
NumWebVisitsMonth 2216.0 5.32 2.43 0.0 3.0 6.0 7.00 20.0
AcceptedCmp3 2216.0 0.07 0.26 0.0 0.0 0.0 0.00 1.0
AcceptedCmp4 2216.0 0.07 0.26 0.0 0.0 0.0 0.00 1.0
AcceptedCmp5 2216.0 0.07 0.26 0.0 0.0 0.0 0.00 1.0
AcceptedCmp1 2216.0 0.06 0.24 0.0 0.0 0.0 0.00 1.0
AcceptedCmp2 2216.0 0.01 0.11 0.0 0.0 0.0 0.00 1.0
Complain 2216.0 0.01 0.10 0.0 0.0 0.0 0.00 1.0
Response 2216.0 0.15 0.36 0.0 0.0 0.0 0.00 1.0
In [84]:
# Check summary statistics of categorical columns
print(cat_cols.describe(),'\n')
print(cat_cols.value_counts())
         Education Marital_Status
count         2216           2216
unique           5              8
top     Graduation        Married
freq          1116            857 

Education   Marital_Status
Graduation  Married           429
            Together          285
            Single            246
PhD         Married           190
Master      Married           138
Graduation  Divorced          119
PhD         Together          116
Master      Together          102
PhD         Single             96
2n Cycle    Married            80
Master      Single             75
2n Cycle    Together           56
PhD         Divorced           52
Master      Divorced           37
2n Cycle    Single             36
Graduation  Widow              35
PhD         Widow              24
2n Cycle    Divorced           23
Basic       Married            20
            Single             18
            Together           14
Master      Widow              11
2n Cycle    Widow               5
PhD         YOLO                2
Master      Alone               1
            Absurd              1
Graduation  Alone               1
            Absurd              1
Basic       Widow               1
PhD         Alone               1
Basic       Divorced            1
dtype: int64
In [85]:
# Replace 2n cycle with master
data_dropped.loc[data_dropped['Education'] == '2n Cycle', 'Education'] = 'Master'

# Replace all alone/YOLO/Absurd data as single
data_dropped.loc[((data_dropped['Marital_Status'] == 'Alone') | 
             (data_dropped['Marital_Status'] == 'YOLO')|
            (data_dropped['Marital_Status'] == 'Absurd')), 'Marital_Status'] = 'Single'
In [86]:
cat_cols_cleaned = data_dropped.select_dtypes(include=['object'])
print(cat_cols_cleaned.describe(),'\n')
print(cat_cols_cleaned.value_counts())
         Education Marital_Status
count         2216           2216
unique           4              5
top     Graduation        Married
freq          1116            857 

Education   Marital_Status
Graduation  Married           429
            Together          285
            Single            248
Master      Married           218
PhD         Married           190
Master      Together          158
Graduation  Divorced          119
PhD         Together          116
Master      Single            113
PhD         Single             99
Master      Divorced           60
PhD         Divorced           52
Graduation  Widow              35
PhD         Widow              24
Basic       Married            20
            Single             18
Master      Widow              16
Basic       Together           14
            Widow               1
            Divorced            1
dtype: int64

Observations and Insights from the Data overview:¶

  • I don't know how they got this data, but having answers like YOLO, absurd and alone, aren't helpful so I combined all of those into single.
  • I did the same with 2n cycle, combining those into Master because it's a post-grad degree but not a PhD which is an important distinction.
  • There is a lot of variance in many variables, sometimes with the STD larger than the mean value.
  • The most common education amongst our customers is graduation and the most common relationship status is married.

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like.
  • Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
In [13]:
# Calculate the number of rows needed for the plots (each column needs 2 rows)
n_cols = 2
n_rows = len(num_cols_no_response.columns)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 3))

# Plot box plots and histograms
for i, column in enumerate(num_cols_no_response.columns):
    # Box plot
    sns.boxplot(x=data[column], ax=axes[i, 0])
    axes[i, 0].set_title(f'Box Plot {column}')
    
    # Histogram
    sns.histplot(data[column], kde=False, ax=axes[i, 1])
    axes[i, 1].set_title(f'Histogram {column}')
plt.tight_layout()

# Show the plot
plt.show()
In [122]:
# Drop the major outliers in Income and Year of Birth
data_dropped = data_dropped[(data_dropped['Income'] <= 600000) & (data_dropped['Year_Birth'] > 1915)]
In [15]:
# Define the columns and rows needed to plot the graphs
n_cols = len(cat_cols_cleaned.columns)
n_rows = 1

fig, axes = plt.subplots(n_rows, n_cols, figsize=(10,5))

# Plot box plots and histograms
for i, column in enumerate(cat_cols_cleaned.columns):
    cat_cols_cleaned[column].value_counts().plot(kind='bar', ax=axes[i])
    axes[i].set_title(f'{column} Bar Plot')
    axes[i].set_xlabel('Categories')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()
In [16]:
# Create an empty data frame
offers_vcs = pd.DataFrame()

# Create a for loop to add the value counts for customers who interacted with brand marketing campaigns
for column in response_cols.columns:
    offers_vcs[column] = response_cols[column].value_counts()
    
# Plot the values    
response_cols.sum().plot(kind='bar')
plt.xlabel('Number of Responses')
plt.ylabel('Frequency')
plt.title('Number of Times Customer Responded to Offers')

plt.show()
print(offers_vcs)
   AcceptedCmp1  AcceptedCmp2  AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  \
0          2096          2211          2077          2073          2077   
1           144            29           163           167           163   

   Complain  Response  
0      2219      1906  
1        21       334  

Observations and Insights from the Data overview:¶

  • Looking at our histograms and box plots, most of the purchase variables are right skewed, indicating that most of our customers make relatively few purchases, while a few customers make large amounts of purchases.
  • We have some major outliers in income and year_birth. I believe these outliers to be mistakes because they are so far out of the norm that I find it very unlikely we have someone who is making $666k a year and that we have multiple customers who are born around 1900.
  • We do have relatively normal distributions in both income and year_birth when not factoring in the unreasonable outliers.
  • Most of our customers are in a relationship whether married or just together and an overwhelming majority of our customers have some type of post-high school degree.
  • The marketing campaigns have a mostly uniform distribution of success except for the 2nd campaign which did far worse than any other campaign.

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.
In [94]:
# Create a heatmap showing the correlation between all numeric variables
plt.figure(figsize  = (12, 10))

sns.heatmap(data_dropped[num_cols_no_response.columns].corr(), annot = False, cmap = "rocket")

plt.show()
In [124]:
#Create a dataset excluding income so we can compare the rest of the variables to income
all_col = num_cols_no_response.columns
except_income = [col for col in all_col if col != 'Income']

# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

# Create scatter plots for each variable
for i, variable in enumerate(except_income):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.scatterplot(data=data_dropped, x='Income', y=variable)
    plt.title(variable)

plt.tight_layout()
plt.show()
In [126]:
data_dropped = data_dropped[(data_dropped['Income'] <= 150000)]

Observations and Insights from the Data overview:¶

  • Not surprisingly income correlates highly with almost all of the purchase variables.
  • Notably income has a seemingly neutral correlation with number of discount purhcases, which makes sense: those with less income are looking for more ways to maximize their money.
  • Number of discount purchases correlates higher with number of web visits which may mean that the deals are more online. Number of web visits is extremely negatively correlated with income. I don't think people with lower income are more likely to have a computer and shop online so I'm not sure why that would be.
  • The customers who say they make 160k make an unusually low amount of purchases in everything but meat so I think they may be a mistake as well and I don't think they will improve our analysis. I took them out of the dataset along with the other outliers.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Imputing missing values

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
In [127]:
# Create a new dataframe to add additional columns
df_feat_eng = data_dropped.copy()

# Determine age just by subtracting birth year from our current year
df_feat_eng['Age'] = 2024 - df_feat_eng['Year_Birth']

# Determine the amount of children in a home by adding kids and teens
df_feat_eng['Dependents'] = df_feat_eng['Kidhome'] + df_feat_eng['Teenhome']
In [128]:
# Create a function to calculate the total household

def calc_household(row):
    if row['Marital_Status'] == ('Married' or 'Together'):
        return row['Dependents'] + 2
    else:
       return row['Dependents'] + 1
In [129]:
# Apply the function to the whole dataset and plot the counts
df_feat_eng['Total_Household'] = df_feat_eng.apply(calc_household, axis=1)
df_feat_eng['Total_Household'].value_counts().plot(kind='bar')
plt.show()
In [130]:
# Add together all purchases that are necessities
df_feat_eng['total_spent_needs'] = (df_feat_eng['MntFishProducts'] + df_feat_eng['MntMeatProducts']
                                    + df_feat_eng['MntFruits']) 

# Add together all purchases that are luxuries
df_feat_eng['total_spent_luxuries'] = (df_feat_eng['MntSweetProducts'] + df_feat_eng['MntWines'] 
                                    + df_feat_eng['MntGoldProds'])

# See how many total offers a customer has accepted
df_feat_eng['total_accepted_offers'] = (df_feat_eng['AcceptedCmp1'] + df_feat_eng['AcceptedCmp2']
                              + df_feat_eng['AcceptedCmp3'] + df_feat_eng['AcceptedCmp4'] 
                              + df_feat_eng['AcceptedCmp5'])

# Count the number of transactions the customer has made over the various mediums
df_feat_eng['total_amt_purchases'] = (df_feat_eng['NumCatalogPurchases'] + df_feat_eng['NumStorePurchases'] 
                                      + df_feat_eng['NumWebPurchases'])

# Calculate the average purchase price by adding all spending and dividing by count of transactions
df_feat_eng['avg_purchase_price'] = round((df_feat_eng['total_spent_needs'] + df_feat_eng['total_spent_luxuries']) 
                                          / df_feat_eng['total_amt_purchases'],2)
In [131]:
# Calculate how many days customers have been enrolled
current_date = datetime.now()
def round_to_days(date):
    return (current_date - date).days

df_feat_eng['days_enrolled'] = df_feat_eng['Dt_Customer'].apply(round_to_days)
In [132]:
# Create a dataframe of just the new columns we made to look at their distributions
new_cols = df_feat_eng[['total_spent_needs', 'total_spent_luxuries', 'total_amt_purchases', 
                        'avg_purchase_price','total_accepted_offers', 'days_enrolled']]
# Calculate the number of rows needed for the plots (each column needs 2 rows)
n_cols = 2
n_rows = len(new_cols.columns)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 3))

# Plot box plots and histograms
for i, column in enumerate(new_cols.columns):
    # Box plot
    sns.boxplot(x=new_cols[column], ax=axes[i, 0])
    axes[i, 0].set_title(f'Box Plot {column}')
    
    # Histogram
    sns.histplot(new_cols[column], kde=False, ax=axes[i, 1])
    axes[i, 1].set_title(f'Histogram {column}')
plt.tight_layout()

# Show the plot
plt.show()

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

  • I believe our new columns are more efficient in expressing the information we want to look at.
  • We see a more substantial tail with customers purchasing non-necessities, indicating that when people have excess money, they'll spend more on those.
  • An overwhelming majority of our customers haven't accepted a single campaign offer.
  • Overall I think the main variable we will need to look at is income. It drives so much of the customer's behavior.

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data
In [133]:
# Create a dataframe of the behavioral variables that will serve as segmentation attributes
seg_att = df_feat_eng[['NumDealsPurchases', 'NumWebVisitsMonth','Complain', 'Response',
                       'total_spent_needs', 'total_spent_luxuries','total_accepted_offers',
                       'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases']]

# Create a dataframe of the characteristic variables that will serve as our profiling attributes.
prof_att = df_feat_eng[['Education', 'Marital_Status', 'Income','Age', 'Dependents','Total_Household',
                       'avg_purchase_price','days_enrolled']]
prof_categorical = prof_att.select_dtypes(include = ['object']).columns
prof_num = prof_att.select_dtypes(include = ['number']).columns

# Scale the data so we can analyze it
scale = StandardScaler()
X = seg_att 
scaledX = scale.fit_transform(X)

data_scaled = pd.DataFrame(scaledX, columns = seg_att.columns)

plt.figure(figsize  = (12, 10))

sns.heatmap(data_scaled.corr(), annot = False, cmap = "rocket")

plt.show()

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [134]:
# Plot the t-sne values with increase perplexity in a for loop
for i in range(10, 50, 5):
    tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)
    
    data_tsne = tsne.fit_transform(scaledX)
    
    data_tsne = pd.DataFrame(data_tsne)
    
    data_tsne.columns = ['X1', 'X2']
    
    plt.figure(figsize = (5,5))
    
    sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
    
    plt.title("perplexity = {}".format(i))

Observation and Insights:

  • There isn't a ton that we can take away from with the t-SNE visualizations. There are a lot of little clusters, one big cluster that isn't very dense, and one more reasonable sized cluster.
  • It's not helpful as helpful when the clusters aren't equally sized.
  • These are probably very complicated clusters

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?
In [139]:
# Defining the number of principal components to generate
n = scaledX.shape[1]

# Finding principal components for the data
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(scaledX))

# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_

plt.figure(figsize = (10, 10))

plt.plot(range(1, n+1),pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')

plt.title("Explained Variances by Components")

plt.xlabel("Number of Components")

plt.ylabel("Cumulative Explained Variance")

plt.show()
In [140]:
# Plot the principal values in relation to the numeric variables
n = 6
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(scaledX))
pc_components = ['PC1', 'PC2', 'PC3','PC4','PC5', 'PC6','PC7','PC8','PC9', ]
cols = pc_components[:n]
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:len(cols)], 2), index = seg_att.columns, 
                   columns = cols)

# Create subplots
fig, axs = plt.subplots(len(cols), 1, figsize=(10, 4 * len(cols)))

# Plot each principal component as a separate subplot
for i, col in enumerate(cols):
    pc1[col].plot(kind='bar', ax=axs[i])
    axs[i].set_title(f'Principal Component Analysis: {col}')
    axs[i].set_xlabel('Variables')
    axs[i].set_ylabel('Principal Component Values')
    axs[i].set_xticklabels(pc1.index, rotation=45)

plt.tight_layout()
plt.show()

Observation and Insights:

I chose to have 6 principal components because it explained over 90% of the variance and I didn't see any benefit to going above that after testing out more or less.

  • PC1 emphasizes all the purchasing behavior except numdealspurchased
  • PC2 focuses on deals purchases, web visits, and web purchases.
  • PC3 focuses on responses and total accepted offers
  • PC4 focuses solely on complaints
  • PC5 avoids total accepted offers and focuses on response and deals purchased.
  • PC6 focuses on deals purchased and accepted offers but avoids web purchaes.

K-Means¶

Think About It:

  • How do we determine the optimal K value from the elbow curve?
  • Which metric can be used to determine the final K value?
In [141]:
sse = {} 

# Iterate for a range of Ks and fit the scaled data to the algorithm. 
# Use inertia attribute from the clustering object and store the inertia value for that K 
for k in range(1, 16):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
    
    sse[k] = kmeans.inertia_

# Elbow plot
plt.figure()

plt.plot(list(sse.keys()), list(sse.values()),'-bo')

plt.xlabel("Number of cluster")

plt.ylabel("SSE")

plt.show()
In [143]:
# Empty dictionary to store the Silhouette score for each value of K
sc = {} 

# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K 
for k in range(2, 16):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
    
    labels = kmeans.predict(data_pca)
    
    sc[k] = silhouette_score(data_pca, labels)

# Elbow plot
plt.figure()

plt.plot(list(sc.keys()), list(sc.values()), '-bo')

plt.xlabel("Number of cluster")

plt.ylabel("Silhouette Score")

plt.show()

Observation and Insights:

  • I chose 6 clusters because the elbow is around 5 or 6 and the silhouette score was higher at 6 than at 5.

Applying KMeans on the PCA data and visualize the clusters¶

In [144]:
# Define the number of clusters I will use throughout the notebook
n_clusters = 6

# Initialize the KMeans algorithm
kmeans_pca = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 1)
In [145]:
# Fit the pca data with kmeans
kmeans_pca.fit(data_pca)

# Add the labels to my data frame
df_feat_eng['KmeansLabels'] = kmeans_pca.labels_

# Define the color palette I want to use throughout the notebook
color_palette = sns.color_palette("colorblind", df_feat_eng['KmeansLabels'].nunique())

# Show the count of each cluster
df_feat_eng['KmeansLabels'].value_counts()
Out[145]:
4    921
1    479
0    452
2    173
3    160
5     20
Name: KmeansLabels, dtype: int64
In [146]:
num_pcs = data_pca.shape[1]  # Number of principal components
plot_index = 1

plt.figure(figsize=(20,num_pcs * 10))

# Create a for loop to compare all the PCs with each other and visualize the clusters
for i in range(num_pcs):
    for j in range(i + 1, num_pcs):
        plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
        sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j], 
                        hue=df_feat_eng['KmeansLabels'],palette=color_palette)
        plt.xlabel(f'PC {i+1}')
        plt.ylabel(f'PC {j+1}')
        plt.title(f'Principal Components {i+1} vs {j+1}')
        plot_index += 1
        
        
plt.tight_layout()
plt.show()

Cluster Profiling¶

In [147]:
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
    
data_tsne = tsne.fit_transform(scaledX)
    
data_tsne = pd.DataFrame(data_tsne)
    
data_tsne.columns = ['X1', 'X2']
    
plt.figure(figsize = (5,5))
    
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['KmeansLabels'],palette=color_palette)
    
plt.title("perplexity = 30")
plt.show()
In [148]:
# Create a list of the names for the numeric columns
num_col = seg_att
#num_col.drop(columns = 'KmeansLabels',inplace=True)

# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
    plt.subplot(len(num_col.columns), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmeansLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [149]:
# Create a list of the variables we want to compare to weight
except_income = [col for col in num_col if col != 'Income']
  
#and col != 'Year_Birth'
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['KmeansLabels'].nunique())

for i, variable in enumerate(except_income):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['KmeansLabels'], 
                    palette=color_palette)
    plt.title(variable)

plt.tight_layout()
plt.show()
In [150]:
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

for h,i in enumerate(prof_categorical):
    grouped_counts = df_feat_eng.groupby(['KmeansLabels',i]).size().reset_index(name='counts')
    plt.subplot(n_rows, n_cols, h + 1)
    sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'KmeansLabels', palette = color_palette)
    plt.xlabel(i)
    plt.ylabel('Counts')
    plt.title(f'KMeans Cluster Profile by {i}')
    plt.legend(title=i)
plt.tight_layout()
plt.show()
In [160]:
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
    clusters = sorted(df['KmeansLabels'].unique())
    n_clusters = len(clusters)
    n_cols = len(columns)
    
    plt.figure(figsize=(15, n_clusters * 5))

    for idx, cluster in enumerate(clusters):
        for jdx, column in enumerate(columns):
            cluster_data = df[df['KmeansLabels'] == cluster]
            counts = cluster_data[column].value_counts()
            
            plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
            plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
            plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')

    plt.tight_layout()
    plt.show()

# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']

# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
In [151]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
    plt.subplot(len(prof_num), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmeansLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observations and Insights:

The clustering isolated the people who complain into one very small cluster. Age wasn't a very helpful profiling attribute because every cluster has a similar range.

Describe the characteristics of each cluster¶

Think About It:

  • Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

Summary of each cluster:

  • Cluster 0: 452 customers. Middle of the road income (from 30k to 80k) with outliers at the high end and low end. Most have 1 child. They look for the discount purchases and visit the website. Highest percentage of divorced customers
  • Cluster 1: 479 customers. Second highest average income. Most have no children or one child. This group is most likely to make their purchases in the store and spend the most (tied with group 2) on needs on average.
  • Cluster 2: 173 customers. Highest income group, up to 100k with an average around 80k. Most have no children with some outliers. This group accepted the most amount of campaign offers, it is also the group that spent the most on luxuries and has the highest average purchase price.
  • Cluster 3: 160 customers. Lower income, higher proportion of single people and highest proportion of PhDs
  • Cluster 4: 921 customers. Lower income, spends very little overall but visits the website a lot. This group is the most recent to have enrolled. They have the lowest purchase price as well. Highest proportion of basic education
  • Cluster 5: 20 customers. Lower income, this is the group that complains. 14/20 are graduates.

K-Medoids¶

In [162]:
kmedo = KMedoids(n_clusters = n_clusters, random_state = 1)

kmedo.fit(data_pca)

df_feat_eng['KmedoLabels'] = kmedo.predict(data_pca)
In [163]:
df_feat_eng['KmedoLabels'].value_counts()
Out[163]:
1    498
5    497
0    385
4    328
2    309
3    188
Name: KmedoLabels, dtype: int64
In [164]:
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
    
data_tsne = tsne.fit_transform(scaledX)
    
data_tsne = pd.DataFrame(data_tsne)
    
data_tsne.columns = ['X1', 'X2']
    
plt.figure(figsize = (5,5))
    
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['KmedoLabels'],palette=color_palette)
    
plt.title("perplexity = 15")
plt.show()

Visualize the clusters using PCA¶

In [165]:
# Assuming data_pca is your DataFrame with principal components and df_feat_eng['KmedoLabels'] is your label column.
num_pcs = data_pca.shape[1]  # Number of principal components
plot_index = 1

plt.figure(figsize=(20,num_pcs * 10))

for i in range(num_pcs):
    for j in range(i + 1, num_pcs):
        plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
        sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j], 
                        hue=df_feat_eng['KmedoLabels'],palette=color_palette)
        plt.xlabel(f'PC {i+1}')
        plt.ylabel(f'PC {j+1}')
        plt.title(f'Principal Components {i+1} vs {j+1}')
        plot_index += 1
        
        
plt.tight_layout()
plt.show()

Cluster Profiling¶

In [166]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
    plt.subplot(len(num_col.columns), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmedoLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [167]:
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['KmedoLabels'].nunique())

for i, variable in enumerate(except_income):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['KmedoLabels'], 
                    palette=color_palette)
    plt.title(variable)

plt.tight_layout()
plt.show()
In [168]:
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

for h,i in enumerate(prof_categorical):
    grouped_counts = df_feat_eng.groupby(['KmedoLabels',i]).size().reset_index(name='counts')
    plt.subplot(n_rows, n_cols, h + 1)
    sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'KmedoLabels', palette = color_palette)
    plt.xlabel(i)
    plt.ylabel('Counts')
    plt.title(f'K Medoids Cluster Profile by {i}')
    plt.legend(title=i)
plt.tight_layout()
plt.show()
In [169]:
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
    clusters = sorted(df['KmedoLabels'].unique())
    n_clusters = len(clusters)
    n_cols = len(columns)
    
    plt.figure(figsize=(15, n_clusters * 5))

    for idx, cluster in enumerate(clusters):
        for jdx, column in enumerate(columns):
            cluster_data = df[df['KmedoLabels'] == cluster]
            counts = cluster_data[column].value_counts()
            
            plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
            plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
            plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')

    plt.tight_layout()
    plt.show()

# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']

# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
In [170]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
    plt.subplot(len(prof_num), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmedoLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observations and Insights:

K-medoids had a more equal distribution of customers over the clusters but they don't seem to have as much distinction between clusters. It distributed the complainers equally amongst the other clusters.

Characteristics of each cluster¶

Summary for each cluster:

  • Cluster 0: 498 customers. Second highest income, most have exactly one kid Second highest average purchase price. Highest proportion of PhDs. Most likely to make their purchase in the store.
  • Cluster 1: 497 customers. Highest income, most have no kids. Highest average purchase price. Close second on proportion of PhDs. Most likely to respond and accept an offer. Most spent on luxuries and needs.
  • Cluster 2: 385 customers. Lowest income, most have one or two kids. Highest proportion of married customers.
  • Cluster 3: 328 customers. Middle of the road income, most have one or two kids. This group doesn’t make a lot of purchases.
  • Cluster 4: 309 customers. Middle of the road income, most have one or two kids. Highest proportion of divorces. Shopped for the most deals.
  • Cluster 5: 188 customers. Lowest income, most have one or two kids. Highest proportion of single customers.

Observations and Insights:

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [171]:
hc_df = data_pca.copy()
In [172]:
hc_df1 = hc_df.copy()
In [173]:
# List of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df1, metric = dm, method = lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
            
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Cophenetic correlation for Euclidean distance and single linkage is 0.6868745753666708.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7301698171979875.
Cophenetic correlation for Euclidean distance and average linkage is 0.8611556779862961.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.7951552210547905.
Cophenetic correlation for Chebyshev distance and single linkage is 0.642375507474261.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6057135793014059.
Cophenetic correlation for Chebyshev distance and average linkage is 0.8586546275910416.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.7252606349299516.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.6800460908323281.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.664531085942267.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.8221753980918977.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7771600130378179.
Cophenetic correlation for Cityblock distance and single linkage is 0.733563535927652.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7125179768514398.
Cophenetic correlation for Cityblock distance and average linkage is 0.8524722661820792.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.8166686006433075.
****************************************************************************************************
Highest cophenetic correlation is 0.8611556779862961, which is obtained with Euclidean distance and average linkage.
In [174]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "centroid","ward", "weighted"] 

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize = (15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df1, metric = "euclidean", method = method)

    dendrogram(Z, ax = axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

    compare.append([method, coph_corr])
In [175]:
# Create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns = compare_cols)

df_cc = df_cc.sort_values(by = "Cophenetic Coefficient", ascending = False)
df_cc
Out[175]:
Linkage Cophenetic Coefficient
2 average 0.861156
3 centroid 0.853083
5 weighted 0.795155
1 complete 0.730170
0 single 0.686875
4 ward 0.564658
In [202]:
HCmodel = AgglomerativeClustering(n_clusters = n_clusters, metric = "euclidean", linkage = "ward")
HCmodel.fit(hc_df1)
Out[202]:
AgglomerativeClustering(metric='euclidean', n_clusters=6)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(metric='euclidean', n_clusters=6)
In [203]:
# Adding hierarchical cluster labels to the original and whole dataframes
hc_df['HCLabels'] = HCmodel.labels_
df_feat_eng['HCLabels'] = HCmodel.labels_
df_feat_eng['HCLabels'].value_counts()
Out[203]:
3    795
0    699
1    417
2    189
4     85
5     20
Name: HCLabels, dtype: int64

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?
  • Are there any distinct clusters in any of the dendrograms?

Visualize the clusters using PCA¶

In [189]:
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
    
data_tsne = tsne.fit_transform(scaledX)
    
data_tsne = pd.DataFrame(data_tsne)
    
data_tsne.columns = ['X1', 'X2']
    
plt.figure(figsize = (5,5))
    
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['HCLabels'],palette=color_palette)
    
plt.title("perplexity = 15")
plt.show()
In [190]:
# Assuming data_pca is your DataFrame with principal components and df_feat_eng['KmedoLabels'] is your label column.
num_pcs = data_pca.shape[1]  # Number of principal components
plot_index = 1

plt.figure(figsize=(20,num_pcs * 10))

for i in range(num_pcs):
    for j in range(i + 1, num_pcs):
        plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
        sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j], 
                        hue=df_feat_eng['HCLabels'],palette=color_palette)
        plt.xlabel(f'PC {i+1}')
        plt.ylabel(f'PC {j+1}')
        plt.title(f'Principal Components {i+1} vs {j+1}')
        plot_index += 1
        
        
plt.tight_layout()
plt.show()
In [ ]:
 
In [ ]:
 

Cluster Profiling¶

In [191]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
    plt.subplot(len(num_col.columns), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['HCLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [192]:
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['HCLabels'].nunique())

for i, variable in enumerate(except_income):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['HCLabels'], 
                    palette=color_palette)
    plt.title(variable)

plt.tight_layout()
plt.show()
In [193]:
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

for h,i in enumerate(prof_categorical):
    grouped_counts = df_feat_eng.groupby(['HCLabels',i]).size().reset_index(name='counts')
    plt.subplot(n_rows, n_cols, h + 1)
    sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'HCLabels', palette = color_palette)
    plt.xlabel(i)
    plt.ylabel('Counts')
    plt.title(f'Hierarchical Clustering Profile by {i}')
    plt.legend(title=i)
plt.tight_layout()
plt.show()
In [194]:
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
    clusters = sorted(df['HCLabels'].unique())
    n_clusters = len(clusters)
    n_cols = len(columns)
    
    plt.figure(figsize=(15, n_clusters * 5))

    for idx, cluster in enumerate(clusters):
        for jdx, column in enumerate(columns):
            cluster_data = df[df['HCLabels'] == cluster]
            counts = cluster_data[column].value_counts()
            
            plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
            plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
            plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')

    plt.tight_layout()
    plt.show()

# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']

# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
In [195]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
    plt.subplot(len(prof_num), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['HCLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observations and Insights:

Hierarchical clustering seemed to have made some identifiable clusters. Cluster 2 seems to be well educated single parents. Cluster 4 seems to be Dual Income No Kids who like to spend money. Cluster 1 might be a more average family who want to get the most bang for their buck.

Characteristics of each cluster¶

Summary of each cluster:

  • Cluster 0: 699 customers. Second highest income with outliers at the high end and the low end. Most have no children or one child. High average purchase price, with a lot of high outliers. Highest percentage of graduation education. This group was the most likely to go into the store to make their purchases.
  • Cluster 1: 417 customers. Middle of the road income. Most have one or two children. Highest proportion of masters. Highest count of discount purchases.
  • Cluster 2: 189 customers. Middle of the road income. Most have exactly one child. Longest average enrollment. Highest proportion of PhDs. Highest divorced, highest widow, and highest single percentage. This group gave responses and accepted some offers.
  • Cluster 3: 795 customers. Lowest income. Most have one or two children. Most recent average enrollment. Lowest average purchase price, highest percentage of basic education and lowest PhD percentage.
  • Cluster 4: 85 customers. Highest income. Most have no children. Highest average purchase price. No basic education with a high master and PhD percentage. This group spent the most on luxuries and needs, and also they accepted the most campaign offers. This group was most likely to make their purchases with a catalog.
  • Cluster 5: 20 customers. Middle of the road income. Again this is the group of complainers.

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [62]:
dbscan_df = data_pca.copy()
dbscan_df1 = dbscan_df.copy()
In [63]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6,20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  

Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶

In [64]:
# Finding the silhouette_score for each of the combination
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                     # Assigning 0's to the high_i_j list
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps = i, min_samples = j).fit(dbscan_df)          # Applying DBScan to each of the combinations in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype = bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(dbscan_df, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                            # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                       # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.6287446447705786
For eps value =2 For min sample =20 The average silhoutte_score is : 0.6179350807374098
For eps value =3 For min sample =6 The average silhoutte_score is : 0.659899134648781
For eps value =3 For min sample =20 The average silhoutte_score is : 0.6769662970660738
In [65]:
# Printing the highest silhouette score
print(
    "Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(
        high_silhouette_avg, high_i_j[0], high_i_j[1]
    )
)
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
In [66]:
# Applying DBSCAN with eps as 3 and min sample as 20
dbs = DBSCAN(eps = 3, min_samples = 20)
In [67]:
# Add DBSCAN cluster labels to dbscan data
df_feat_eng["DBLabels"] = dbs.fit_predict(dbscan_df1)

# Add DBSCAN cluster labels to whole data
dbscan_df1['DBLabels'] =  dbs.fit_predict(dbscan_df)
In [68]:
df_feat_eng['DBLabels'].value_counts()
Out[68]:
 0    2185
-1      27
Name: DBLabels, dtype: int64

Observations and Insights: DBscan is not usable since it only made 2 clusters and one of them is 10 times larger than the other.

Think about it:

  • Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

Characteristics of each cluster¶

Summary of each cluster: N/A

Gaussian Mixture Model¶

In [204]:
gmm_df = data_pca.copy()
In [205]:
# Let's apply Gaussian Mixture
gmm = GaussianMixture(n_components = n_clusters, random_state = 1)  # Initializing the Gaussian Mixture algorithm with n_components = 4

gmm.fit(gmm_df)     
Out[205]:
GaussianMixture(n_components=6, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianMixture(n_components=6, random_state=1)
In [206]:
gmm_df["GMM_segments"] = gmm.predict(gmm_df)
df_feat_eng["GMMLabels"] = gmm.predict(data_pca)
df_feat_eng['GMMLabels'].value_counts()
Out[206]:
2    772
0    527
1    505
5    232
3    149
4     20
Name: GMMLabels, dtype: int64
In [207]:
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
    
data_tsne = tsne.fit_transform(scaledX)
    
data_tsne = pd.DataFrame(data_tsne)
    
data_tsne.columns = ['X1', 'X2']
    
plt.figure(figsize = (5,5))
    
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['GMMLabels'],palette=color_palette)
    
plt.title("perplexity = 15")
plt.show()

Observations and Insights:

Visualize the clusters using PCA¶

In [208]:
# Assuming data_pca is your DataFrame with principal components and df_feat_eng['KmedoLabels'] is your label column.
num_pcs = data_pca.shape[1]  # Number of principal components
plot_index = 1

plt.figure(figsize=(20,num_pcs * 10))

for i in range(num_pcs):
    for j in range(i + 1, num_pcs):
        plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
        sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j], 
                        hue=df_feat_eng['GMMLabels'],palette=color_palette)
        plt.xlabel(f'PC {i+1}')
        plt.ylabel(f'PC {j+1}')
        plt.title(f'Principal Components {i+1} vs {j+1}')
        plot_index += 1
        
        
plt.tight_layout()
plt.show()

Cluster Profiling¶

In [209]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
    plt.subplot(len(num_col.columns), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['GMMLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [210]:
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['GMMLabels'].nunique())

for i, variable in enumerate(except_income):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['GMMLabels'], 
                    palette=color_palette)
    plt.title(variable)

plt.tight_layout()
plt.show()
In [211]:
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2

plt.figure(figsize=(15, n_rows * 5))

for h,i in enumerate(prof_categorical):
    grouped_counts = df_feat_eng.groupby(['GMMLabels',i]).size().reset_index(name='counts')
    plt.subplot(n_rows, n_cols, h + 1)
    sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'GMMLabels', palette = color_palette)
    plt.xlabel(i)
    plt.ylabel('Counts')
    plt.title(f'GMM Cluster Profile by {i}')
    plt.legend(title=i)
plt.tight_layout()
plt.show()
In [212]:
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
    clusters = sorted(df['GMMLabels'].unique())
    n_clusters = len(clusters)
    n_cols = len(columns)
    
    plt.figure(figsize=(15, n_clusters * 5))

    for idx, cluster in enumerate(clusters):
        for jdx, column in enumerate(columns):
            cluster_data = df[df['GMMLabels'] == cluster]
            counts = cluster_data[column].value_counts()
            
            plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
            plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
            plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')

    plt.tight_layout()
    plt.show()

# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']

# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
In [213]:
# Set plot size
plt.figure(figsize = (15, 30))

# Standardize the order of box plots
group_order = range(n_clusters)

# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
    plt.subplot(len(prof_num), 2, i + 1)
    
    sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['GMMLabels'], order = group_order)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observations and Insights:

GMM has some interesting clusters. There always seems to be, with every method, a very large cluster of low income people who do not make any purchases with the company, that is group 2 for GMM. GMM has identified the high income, high spenders who don't have any children, which is group 5. GMM also identified the families who are trying to find the discount purchases but don't have a super high income, which is group 0.

Characteristics of each cluster¶

Summary of each cluster:

  • Cluster 0: 527 customers. Middle of the road income, most have one or two children. High amount of discount purchases, made the most visits to the web.
  • Cluster 1: 505 customers. Second highest income, most have no children or one child. Highest percentage of graduation education. This group is most likely to make catalog purchases and store purchases. They spend the second most on luxuries and needs.
  • Cluster 2: 772 customers. Lowest Income, most have one or two children. Highest percentage of basic education, lowest PhDs. This group has been enrolled the shortest. This group has spent the least on luxuries and needs. This group doesn’t make very many purchases.
  • Cluster 3: 149 customers. Second lowest income, most have one or two children. Highest percentage of PhDs and Masters combined. Very high percentage of single, divorced and widowed. High amount of discount purchases. This group responded and accepted some offers. This group has been enrolled the longest.
  • Cluster 4: 20 customers. Lower income, Most have one or two children. All the complainers
  • Cluster 5: 232 customers. Highest income. Almost all have no children. Very high percentage of single, divorced and widowed. Made very few, if any discount purchases. This group responded the most and accepted the most offers. This group spent hte most on luxuries and needs and they had the highest average purchase price.

Method Evaluation¶

In [214]:
kmeans = KMeans(n_clusters = n_clusters, random_state = 1, n_init = 'auto')        # Initializing K-Means with number of clusters as 4 and random_state=1

preds = kmeans.fit_predict((data_pca))                   # Fitting and predicting K-Means on data_pca

score = silhouette_score(data_pca, preds)                # Calculating the silhouette score

print(score)         
0.39032909475551403
In [215]:
kmedoids = KMedoids(n_clusters = n_clusters, random_state = 1)   # Initializing K-Medoids with number of clusters as 4 and random_state=1

preds = kmedoids.fit_predict((data_pca))                # Fitting and predicting K-Medoids on data_pca

score = silhouette_score(data_pca, preds)               # Calculating the silhouette score

print(score)   
0.12392534797169114
In [216]:
# Initializing Agglomerative Clustering with distance as Euclidean, linkage as ward with clusters = 4
HCmodel = AgglomerativeClustering(n_clusters = n_clusters, metric = "euclidean", linkage = "ward",) 

# Fitting on PCA data
preds = HCmodel.fit_predict(data_pca)  

score = silhouette_score(data_pca, preds)             # Calculating the silhouette score

print(score)   
0.30535809430078065
In [217]:
# Initializing Gaussian Mixture algorithm with number of clusters as 4 and random_state = 1
gmm = GaussianMixture(n_components=n_clusters, random_state=1)   

# Fitting and predicting Gaussian Mixture algorithm on data_pca
preds = gmm.fit_predict((data_pca))

# Calculating the silhouette score
score = silhouette_score(data_pca, preds)

# Printing the score
print(score)
0.3320312327281025

Conclusion and Recommendations¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

It was interesting to see how each one performed and the clusters the methods found. K-means had the highest silhouette score but I don't think I could identify the underlying patterns that made the clusters. I found Hierarchical and GMM clustering to make more sensible clusters. K-medoids was pretty useless since its clusters didn't seem to have any defining features that would be helpful.

2. Refined insights:

  • What are the most meaningful insights from the data relevant to the problem?

I think the most meaningful insight is that this company has a large swath of its customer base who are low income and make very few purchases. No offense to these customers, but they just aren't where the money is. I would put more effort into identifying and finding the customers that fit into the higher spending clusters that we made today.

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?

I suggest the company to adopt the Hierarchical Clustering model. I was choosing between that and the GMM model and while the GMM model has a higher silhouette score and a little bit of a more balance distribution, I think the Hierarchical clusters make the most sense and give the most direction for where to focus your resources. Cluster 4 is the cluster that the company needs to mine and grow. You need to be taking this profile and using it to target your advertising to find more potential customers that would fit into that cluster. Then after trying to scout out more customers like that, you can focus on in-store promotions for cluster 0 since that is a high-income, high spending cluster that has a larger number of customers in it.

To summarize, we want to grow and spend money on adverstising to attract leads that are similar to the cluster 4 profile and we want to focus our customer retention efforts on customers that fit into cluster 0.

Recommendations for Implementation¶

As evidenced above, we have wasted a lot of resources on marketing campaigns that are ignored by our customers because we're not listening to them. We need to hear what they say and how they behave and meet them where they are. To do that here are our proposed actionable steps:

  • Advertising department: Initiate advertising campaigns to grow cluster 4. Invest in Instagram and Facebook ads that are able to target users that fit this cluster profile. In those ads, we recommend emphasizing more of luxury products, including wines, sweets, and gold products.
  • Marketing department: we recommend creating a marketing campaign to target those in cluster 1 and encourage their loyalty. They should feel appreciated and given ample reasons to keep coming back. This could be a customer appreciation week, loyalty tote bags, or special loyalty sales that only show up in our app.
  • Data science department: we need to do research about what is and what is not worth discounting at our stores through sales and coupons. We want to advertise coupons and sales to the people who are most likely going to use them otherwise they are useless. We should run experiments to test at what prices it takes to get those in cluster 1 to make purchases and then calculate whether the promotions are worth it.
  • Data collection team: To better understand our customers and their habits we should start tracking time of purchases. We can look at receipts and determine when customers were here and when certain products are being purchased. This will give us much better information for stocking, promotions, and staffing.
  • Customer support department: Reach out to the customers who complained to offer a promotion of some sort to increase favor with them. There are not very many complaints right now so this should be doable, if the complaints increase for some reason we will reevaluate.

The benefits of this solution is that we will be much more efficient with our resources. With our prior strategy our accepted campaign rate was roughly 9% because so many offers went to people who were never going to purchase anything in the first place. By using our clusters, we believe that the accepted offer rate will at least double to almost 20%. We will be reaching fewer people, but every dollar spent will have a higher ROI. Our marketing campaign to increase loyalty amongst those in cluster 0 is estimated to increase retention by 10% and thus increase overall revenue by 3%.

The costs of this solution will be a slight loss in revenue from those in cluster 3 because we are not advertising to them as much, but overall that will save us money. The other cost will be an increased budget for the data collection team as they will need to spend extra time instituting the system to track time of purchase.

The risks of our solution is that we are focusing on a relatively small group of customers. This group only has a total of 85 customers in it right now, so we can't be sure that their behavior is scaleable. While it could prove more difficult to grow that cluster than I am outlining in this proposal, I believe it is worth the risk because their impact is comparatively outsized.

We believe with our proposed solution, we can consistently increase profits as we focus on the customers who want to make purchases from our store, but that does not mean this report completes our analysis as a company. In order to consistently increase our profits and grow as a business, we need to consistently invest in our understanding of our customers and what kinds of transactions they're making. We need to continually experiment with prices to see where our customers thresholds are and what they value. We need to run campaigns to see what customers want to see from us, do they value lower prices or do they value the experience of shopping in our store more? Do they like shopping online or do they like in-person still? With more information, such as time of purchase and everything else outline here, we can further optimize our store and continue to grow.

In [ ]: